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Abstract 


Sparse  data  models  have  gained  considerable  attention  in  recent  years,  and  their 
use  has  led  to  state-of-the-art  results  in  many  signal  and  image  processing  tasks.  The 
learning  of  sparse  models  has  been  mostly  concerned  with  adapting  the  dictionary 
to  tasks  such  as  classification  and  reconstruction,  optimizing  extrinsic  properties  of 
the  trained  dictionaries.  In  this  work,  we  first  propose  a  learning  method  aimed  at 
enhancing  both  extrinsic  and  intrinsic  properties  of  the  dictionaries,  such  as  the  mutual 
and  cumulative  coherence  and  the  Gram  matrix  norm,  characteristics  known  to  improve 
the  efficiency  and  performance  of  sparse  coding  algorithms.  We  then  use  tools  from 
information  theory  to  propose  a  sparsity  regularization  term  which  has  several  desirable 
theoretical  and  practical  advantages  over  the  more  standard  io  or  t-i  ones.  These  new 
sparse  modeling  components  lead  to  improved  coding  performance  and  accuracy  in 
reconstruction  tasks. 

1  Introduction 

Sparse  modeling  calls  for  constructing  a  succinct  representation  of  some  data  as  a  combina¬ 
tion  of  a  few  typical  patterns  (atoms)  learned  from  the  data  itself.  Significant  contributions 
to  the  theory  and  practice  of  learning  such  collections  of  atoms  (usually  called  dictionaries 
or  codebooks),  e.g.,  [1,  12,  24],  and  of  representing  the  actual  data  in  terms  of  them,  e.g., 
[6,  7,  8],  have  been  developed  in  recent  years,  leading  to  state-of-the-art  results  in  many 
signal  and  image  processing  tasks  [10,  18,  22,  20,  25].  We  refer  the  reader  for  example  to 
[3]  for  a  recent  review  on  the  subject. 

In  all  cases,  the  actual  dictionary  plays  a  critical  role.  Current  techniques  for  obtaining 
such  dictionaries  involve  the  optimization  of  their  extrinsic  properties  in  terms  of  the  task 
to  be  performed  (i.e.,  representation  [12],  denoising  [10,  22],  or  classification  [20]).  However, 
theoretical  results  addressing  the  success  in  recovering  sparse  signals,  as  well  as  the  efficiency 
of  sparse  coding  algorithms,  are  related  to  intrinsic  properties  of  the  dictionary  such  as  the 
mutual  coherence,  the  cumulative  coherence,  and  the  Gram  matrix  norm  of  the  dictionary 
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[9,  30].  We  will  provide  precise  definitions  of  these  magnitudes  in  the  sequel.  Addressing 
these  important  intrinsic  properties  is  one  of  the  goals  of  the  work  here  presented. 

A  critical  component  of  sparse  modeling  is  the  actual  sparsity  of  the  solution,  which 
is  controlled  by  some  critical  model  parameters.  Choosing  the  optimal  values  of  these 
parameters  for  the  actual  signals  to  model  and  the  problem  at  hand  is  a  challenging  task. 
Several  solutions  to  this  problem  have  been  proposed,  ranging  from  the  automatic  tuning  of 
the  parameters  [15]  to  Bayesian  hierarchical  models,  where  these  parameters  are  themselves 
considered  as  random  variables  [14,  15,  31].  In  this  paper  we  address  this  challenge,  and  at 
the  same  time  further  generalize  the  standard  sparsifying  penalty  functions,  exploiting  tools 
from  information  theory. 

The  first  contribution  of  this  work  is  the  explicit  incorporation  of  a  new  term  in  the  sparse 
modeling  formulation  that  induces  the  desired  intrinsic  properties  of  low  mutual  coherence 
and  low  Gram  matrix  norm  in  the  learned  dictionaries.  We  show  how  this  leads  to  the  better 
performance  of  coding  algorithms.  The  learned  dictionaries  also  exhibit  desirable  extrinsic 
properties  such  as  reduced  overfitting,  a  direct  consequence  of  the  reduced  coherence  between 
the  dictionary  atoms.  Our  second  contribution  is  the  substitution  of  the  traditional  t'o  or 
i\  sparsity-inducing  priors  by  one  which  we  derive  using  information-theoretic  tools.  This 
prior  has  several  desirable  theoretical  and  practical  properties  such  as  statistical  consistency, 
improved  robustness  to  outliers  in  the  data,  and  leads  to  a  better  sparse  reconstruction  than 

and  £i-based  techniques  in  practice. 

The  rest  of  this  paper  is  organized  as  follows:  in  Section  2  we  introduce  the  standard 
framework  of  sparse  modeling.  Sections  3  to  5  are  dedicated  to  the  derivation  of  our  proposed 
model.  Section  6  gives  details  on  the  implementation  of  the  learning  algorithm.  Sections  7 
and  8  present  experimental  results  showing  the  importance  of  the  proposed  sparse  model 
for  image  representation.  Concluding  remarks  are  given  in  Section  9. 


2  Sparse  modeling 

Let  X  G  be  a  set  of  N  column  data  vectors  Xj  G  K",  D  G  be  a  dictionary 

of  K  atoms  represented  as  columns  G  K",  and  A  =  {aij}  G  G  K^,  be  a 

set  of  reconstruction  coefficients  such  that  X  =  DA.  For  each  j  =  1, . . . ,  A  we  define  the 
active  set  of  Aj  as  Aj  =  {i  :  aij  ^  0},  and  ||Aj||p  =  \Aj\  as  its  cardinalty.  The  goal  of 
sparse  modeling  is  to  design  a  dictionary  D  such  that  X  =  DA  with  ||Aj||p  sufficiently 
small  (usually  below  some  threshold)  for  all  or  most  data  samples  X^  .  For  a  fixed  D,  the 
actual  computation  of  A  is  called  sparse  coding  (SC). 

We  begin  our  discussion  with  the  standard  £1  penalty  modeling  problem, 

(A*,  D*)  =  argmm  {||X  -  DA]]^  +  A  ||A||,}  (1) 

where  Ij-llp  denotes  Frobenius  norm.  The  cost  function  to  be  minimized  in  (1)  consists  of  a 
quadratic  fitting  term  and  an  £i  regularization  term  for  each  column  of  A,  the  balance  of  the 
two  being  defined  by  the  penalty  parameter  A.  The  li  norm  is  used  as  an  approximation  to  Iqi 
making  the  problem  convex  in  A  while  still  encouraging  sparse  solutions  [3].  Furthermore, 
it  has  been  shown  that  under  certain  conditions  on  D  and  X,  the  solutions  to  the  £i-based 
and  £o-based  sparse  coding  problems  coincide,  e.g.,  [4].  The  dictionaries  learned  with  the 
model  here  proposed  explicitly  encourage  such  conditions. 
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3  Learning  incoherent  dictionaries 

Most  techniques  for  dictionary  learning  are  concerned  with  the  extrinsic  properties  of  the 
learned  dictionaries,  e.g.,  requiring  small  reconstruction  errors  while  simultaneously  impos¬ 
ing  a  sparsity  constraint  [10,  12,  22,  24].  However,  recent  results  [3,  7,  30]  in  sparse  coding 
theory  indicate  that  certain  intrinsic  properties  of  the  dictionaries  have  a  direct  impact  on 
the  performance  of  coding  algorithms.  Optimizing  for  such  intrinsic  properties  is  the  goal 
of  this  section. 

In  the  context  of  compressed  sensing,  the  idea  of  learning  the  sensing  matrix,  not  quite 
the  dictionary  as  here  proposed,  targeted  at  the  optimization  of  intrinsic  (effective)  dictio¬ 
nary  properties,  has  been  recently  proposed  in  [9]  (see  also  [28] ) .  It  has  been  also  shown  that 
designing  the  sensing  matrix  can  accelerate  the  sparse  coding  optimization  [17].  In  this  sec¬ 
tion  we  propose  to  learn  dictionaries  that  in  addition  to  achieving  small  reconstruction  errors 
with  sparse  representations,  have  intrinsic  properties  that  lead  to  important  performance 
improvements. 

Let  us  assume  that  the  atoms  are  normalized,  ||Dfc||  =  1.  The  L-cumulative  coherence 
of  D  is  defined  as  /iL(D)  =  max  jmax^^j  J2jej  ^  ^  {Ij  •  ■  ■ ;  K}  ,  |  J|  =  l|  [30]. 

When  L  is  not  specified,  we  assume  it  to  be  its  maximum  value  K  —  1,  so  that  /1(D)  = 
/tif_i(D).  Assuming  there  is  a  true  sparse  representation  for  the  data,  the  success  in 
recovering  it  by  approximating  the  £q  sparse  formulation  using  Orthogonal  Matching  Pursuit 
(OMP)  [23],  or  by  solving  the  Basis  Pursuit  (BP)  [6]  problem  [30],  is  influenced  by  these 
quantities,  which  we  optimize  for  below. 

Another  important  property  of  D  is  its  Gram  matrix  G  =  D^D,  whose  matrix  /2-norm 
p(D)  is  known  to  influence  the  speed  of  convergence  of  popular  shrinkage-based  coding 
algorithms  [7,  11].  We  recall  that  p(D)  =  ||G||2  =  max{|7|  :  7  S  7(G)},  where  7(G)  denotes 
the  set  of  eigenvalues  of  G.  Let  7max  denote  the  eigenvalue  with  largest  absolute  value,  that 
is,  p(D)  =  |7niax|-  By  the  Gershgorin  Gircle  Theorem  [16,  pp.  320-321],  we  have  that 
7(G)  C  where  Cfc  =  {y  :  \y  -  gkk\  <  J2r^k  \9kr\}  and  gkr  =  D^D^  are  the  elements 

of  G.  For  normalized  atoms  we  have  for  all  k  that  gkk  =  1,  and  thus  J2r^k  \9kr\  <  /1(D). 
Plugging  these  values  in  the  Gershgorin  Theorem  we  obtain  that  p(D)  =  |7max|  <  1-1- /1(D). 
Therefore,  we  can  simultaneously  reduce  p(D)  and  /1(D)  by  imposing  small  off-diagonal 
elements  in  G.  This  suggests  the  addition  of  the  following  term  to  Equation  (1) 

||d^d-Ik||^  (2) 

where  Ik  is  the  KxK  identity  matrix.  In  Section  8  we  will  show  that  this  new  term  helps 
to  improve  reconstruction  accuracy  and  speed.  The  final  proposed  cost  function  is  shown 
later  in  Section  5  after  a  further  modification  to  Equation  (1),  which  we  discuss  next. 

4  Universal  models  for  sparse  coding 

Sparse  models  are  often  presented  in  a  probabilistic  framework  [19,  24].  From  this  point 
of  view,  the  solution  for  D*  in  (1)  can  be  seen  as  an  approximate  solution  to  a  Maximum 
Likelihood  estimation  of  D  given  X,  that  is,  with  a  slight  abuse  of  notation, 

D*  =  argm^|p(X|D)  =  J  p(X,A|D)dA  =  J  p(X|A,  D)p(A|D)dA| ,  (3) 


3 


where  fl  represents  the  space  where  A  takes  its  values.  In  this  framework,  A  is  considered 
to  be  a  hidden  state  variable  which  is  marginalized  in  the  integral.  A  direct  solution  of  the 
integral  in  (3)  is  generally  not  possible,  so  approximate  solutions  are  sought  instead.  One 
such  approximation,  followed  in  [24],  is  to  maximize  the  mode  of  the  integrand  p(X,  AjD) 
instead  of  the  whole  integral,  which  occurs  at  (X,  A*) .  This  is  roughly  justified  by  assuming 
that  p(X,  A|D)  is  unimodal  and  highly  peaked,  so  that  its  value  at  the  mode  accounts  for 
most  of  the  value  of  the  integral.  The  maximization  is  then  carried  on  in  the  logarithmic 
scale,  resulting  in 


D 


=  argnmx|m|ix{logp(X|A,D)  +logp(A|D)}|  . 


(4) 


The  formulation  (1)  is  then  obtained  from  (4)  considering  the  reconstruction  error  to  be  IID 
Gaussian  with  mean  0  and  variance  cr^,  p(X|A,  D)  oc  exp(— 11^"  DA]]^);  and  an  IID 
Laplacian  prior  with  mean  0  and  parameter  9  on  the  reconstruction  coefficients,  which  is 
furthermore  assumed  to  be  independent  of  D,  p(A|D)  ~  p(A|0)  cx  Equation  (1) 

follows  by  taking  the  logarithms  of  both  priors  and  factorizing  into  A  =  2a^9. 

Considering  D  fixed  we  now  turn  to  the  problem  of  finding  the  aforementioned  mode 
ofp(X,A|D).  For  a  Laplacian  prior,  this  is  the  sparse  coding  problem  that  appears  in  (1) 
when  optimizing  only  over  A.  When  considering  the  statistical  modeling  of  small  patches 
from  natural  images,  the  assumption  that  transform  coefficients  (e.g.,  dct  or  wavelet)  are 
well  modeled  by  a  Laplacian  distribution  is  widely  accepted.  Assuming  the  coefficients  of 
A  to  be  IID  leads  to  the  sparse  coding  problem  in  (1),  as  discussed  above.  Even  under  these 
strong  assumptions,  the  selection  of  an  optimal  value  of  A  (or  9  for  known  cr^),  is  already  a 
challenging  problem  (see  [15]  for  example). 

It  can  be  argued  that  a  better  model  would  be  one  with  different  Laplacian  parameters 
for  coefficients  associated  to  different  atoms  (that  is,  different  rows  of  A),  as  it  happens  with 
DCT  coefficients  associated  with  different  basis  functions.  Even  so,  an  IID  assumption  for  the 
rows  of  A  also  seems  inadequate  for  natural  images,  whose  statistics  vary  across  different 
regions  (e.g.,  textures,  boundaries).  Therefore,  instead  of  tuning  for  a  fixed  A  or  multiple 
values  of  A,  we  look  for  a  more  general  prior  that  can  fit,  without  knowing  A  in  advance, 
almost  as  well  as  a  Laplacian  whose  parameter  9  was  tuned  for  that  specific  instance  of 
A.  The  answer  to  such  problem  is  given  by  the  information-theoretic  concept  of  universal 
coding  which  lies  at  the  core  of  the  Minimum  Description  Length  (mdl)  principle  [2]:  given 
a  set  of  probability  models  M  =  {p(j0)  :  0  S  0}  parameterized  by  some  9,  it  is  possible 
to  construct  a  probability  model  g(-)  that  fits  the  data  to  be  modeled  approximately  as 
well  as  the  probability  model  p{-\9)  in  A4  that  best  fits  the  data.  Here  9  is  the  Maximum 
Likelihood  Estimator  (mle)  of  9.  One  way  to  construct  such  a  universal  prior  is  through 
a  Bayesian  mixture.  In  a  Bayesian  mixture  q{‘),  the  probability  of  a  given  coefficient  value 
q{a)  =  Vr{aij  =  a)  is  obtained  by  averaging  the  probability  assigned  to  a  by  p{-\9),  for  all 

6»  e  0, 

q{a)  =  I  p{a\9)w{9)d9,  (5) 

Je 

where  w{9)  is  an  (hyper-)prior  on  9.  In  Bayesian  theory,  w{9)  reflects  the  prior  belief 
on  the  values  of  9.  This  is  the  main  idea  behind  sparse  Bayesian  coding  works  such  as 
[14,  29].  However,  in  universal  coding/MDL  theory  such  interpretation  is  not  necessary,  and 
it  can  be  shown  that  any  smooth  choice  w{9)  is  enough  to  guarantee  the  universality  of  the 
resulting  mixture  [2].  This  allows  us  to  obtain  a  closed  form  solution  of  (5)  for  Laplacian 
p{-\9)  in  the  following  way.  Since  the  Laplacian  is  a  symmetrized  version  of  the  Exponential 
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distribution,  we  can  perform  the  mixture  on  the  Exponential  distribution  and  then  add  back 
the  symmetry  by  substituting  a  with  |a|  and  dividing  the  normalization  constant  by  1/2.  A 
closed  form  solution  of  (5)  can  be  obtained  by  using  the  conjugate  prior  for  the  Exponential, 
which  is  the  Gamma  distribution, 

w{e\K,l3)  = 

where  n  and  (3  are  its  shape  and  scale  parameters  respectively.  By  using  the  Gamma  prior, 
and  later  re-symmetrizing  the  distribution  we  arrive  at  the  following  symmetric  distribution, 

<7MOL(a|/3,  k)  =  K/3”(|a|  -k (6) 

which  we  call  a  Mixture  of  Laplacians  (mol).  Although  the  resulting  prior  has  two  param¬ 
eters  to  deal  with  instead  of  one,  it  will  be  shown  later  in  Section  7.1,  that  a  single  mol 
distribution  can  fit  each  of  the  K  rows  of  A  better  than  K  separate  Laplacian  distributions 
fine-tuned  to  these  rows,  for  a  total  of  K  parameters  to  be  estimated.  Furthermore,  both 
K  and  /3  are  easily  computed  using  the  method  of  moments.  It  is  easy  to  see  that  the  non 
central  moment  of  order  j  of  the  MOL  distribution  is  iXj  =  Thus,  given  sample 

estimates  Ai  =  ^  A2  =  ^ 

K  =  2(^2  -  Ai)/(A2  -  2A?)  and  /3=(k-l)Ai,  (7) 

When  the  MOL  prior  is  plugged  into  (4),  the  following  new  cost  function  is  obtained, 

N  K 

/(X,A|D)  =  ||X-DA||^-br^5]log(|a.,|  +  /3),  (8) 

1  =  1  i=l 

where  r  =  2(T^(k  -|-  1).  The  resulting  logarithmic  non-convex  MOL  regularization  term, 
logp(A|D),  is  known  in  robust  statistics  as  the  Lorentzian  norm,  also  known  to  be  more 
robust  to  outliers  than  the  £i  norm.  We  also  know  from  the  statistics  literature  that  the  MOL 
regularizaton  term  leads  to  consistent  estimators  of  regression  coefficients  which  are  able  to 
identify  the  relevant  variables  in  a  regression  model  (oracle  property)  [13].  This  is  not  the 
case  for  the  regularizer  [31].  This  same  regularizer  has  also  been  recently  proposed  in  the 
context  of  compressive  sensing  [5],  where  it  is  conjectured  to  be  better  than  the  £i-term  at 
recovering  sparse  signals.^  Our  results  in  Section  7  give  evidence  that  this  is  indeed  the  case, 
with  the  direct  consequence  of  a  much  improved  reconstruction  accuracy  of  sparse  data.  We 
also  show  in  Section  7  that  the  MOL  prior  is  much  better  to  model  reconstruction  coefficients 
drawn  from  a  large  database  of  image  patches.  We  will  also  see  next  that  although  the  mol 
regularizer  is  non-convex,  simple  and  effective  methods  are  available  to  solve  the  resulting 
sparse  coding  (or  regression)  problems. 


Remark:  Instead  of  using  a  Gamma  prior  (which  actually  already  fits  the  empirical  statis¬ 
tics  of  image  data  very  good  [26]),  we  can  use  for  example  the  Jeffreys  prior,  which  has 
several  interesting  properties  both  from  the  information-theoretic  and  statistical  points  of 
view.  The  Jeffreys  prior  for  a  given  distribution  is  defined  as 


wj{0) 


VW) 

UVW)dO 


^In  [.')],  the  logarithmic  regularizer  arises  from  approximating  the  io  pseudo-norm  as  a  ^i-normalized 
element-wise  sum. 
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where  I {9)  is  the  Fisher  Information  matrix  of  a  parametric  distribution  of  parameter  9\ 


m 


902 


logp(x|0) 


9=0 


(9) 


For  the  Exponential  distribution  (which  is  the  one-sided  version  of  the  Laplacian)  we  have 
that  I{6)  =  Clearly,  if  we  let  0  =  (0,  oo),  the  integral  in  (9)  evaluates  to  oo.  To  get  a 
proper  integral,  we  restrict  0  to  be  a  closed  interval  [a,  6],  0  <  a  <  6  <  oo.  For  this  choice 
we  get  wj{9)  =  g-  resulting  mixture,  after  being  symmetrized  around  0,  has  the 

following  form 

We  refer  to  this  prior  as  a  Jeffreys  mixture  of  Laplacians  (jol).  Note  that  although  ^jql 
is  not  defined  for  a  =  0,  its  limit  when  a  — >  0  is  finite  and  evaluates  to  2  in(~b/a)  ■  Thus, 
by  defining  (?jol(0)  =  2  in(b/a)  obtain  a  prior  that  is  well  defined  and  continuous  for  all 
a  G  K.  Experimental  results  with  this  prior,  as  well  as  the  full  mathematical  details  of  the 
derivations  here  described  are  provided  in  [27]. 


5  Proposed  sparse  model 

To  the  model  in  (8),  which  replaces  the  more  classical  (1),  we  add  the  term  for  dictionary 
incoherence  introduced  in  Section  3,  leading  to  the  proposed  sparse  model 

N  K  K 

/(X,D,A)  =  ||X-DA||^  +  r^^log(|a,,|  +  /3)  +  C||D^D-I^||^  +  ,?^(||Dfc||2-l)^ 

j—1 i—1  k—1 

(11) 

The  last  term  is  added  as  a  standard  way  to  maintain  the  atom  norms  close  to  one.^  This 
(soft)  normalization  was  empirically  observed  to  yield  better  results  than  a  forced  projection 
of  the  atoms  after  each  update,  which  is  the  approach  followed  in  [1,  12,  24]. 

6  Numerical  optimization 

To  minimize  (11)  we  apply  the  standard  approach  of  alternate  minimization.  We  start 
with  an  arbitrary  initial  dictionary  and  repeat  the  following  sequence  of  updates  until 
convergence 

SC  :  A(*+i)  =  argnnn|/(X,D(*),A)|  and  DU  :  =  argnnn  |/(X,  D,  A(‘+i))|  . 

(12) 

Sparse  coding  (sc)  For  fixed  D,  the  cost  function  in  (11)  is  non-convex  in  A.  We  handle 
this  issue  as  in  [5],  using  a  surrogate  function  technique  where  the  optimal  solution 
to  the  SC  problem  is  the  limit  of  solutions  to  a  sequence  of  subproblems,  where  the  non- 
convex  cost  function  is  approximated  by  convex  functions  that  are  easy  to  solve.  More 
specifically,  we  have  that  A^*+^)  =  lim^^oo{4'(^)},  with  =  argmin,j,  |/i(*^(X,D,'I')|. 

^Note  that  the  term  jjD^D  —  Ik]]^  does  not  enforce  the  atoms  to  be  normalized. 
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The  surrogate  is  obtained  by  a  first  order  expansion  of  the  logarithmic  terms  in  /(•) 

around  the  previous  iterate 

logdV'l  +  /3)  <  logdV'^''”^^ !+/?)+ 

where  '0  denotes  a  generic  element  of  the  matrix  This  optimization  technique  is  a 
special  case  of  the  Local  Linear  Approximation  (lla)  technique  described  in  [32],  where  it 
was  shown  to  converge  to  a  stationary  point.  Discarding  the  constant  terms  and  defining 

we  get  the  following  sequence  of  subproblems: 

(13) 

We  set  0!°^  =  0,  so  that  the  first  iterate  is  the  solution  to  the  unmixed  model  (1)  with 
A  =  2(7^ (k  +  l)//3,  which  we  expect  to  be  a  good  starting  point  for  the  minimization.  For 
z  >  1,  the  problem  in  (13)  corresponds  to  a  weighted  version  of  (1),  which  can  be  solved 
with  any  £i  solver. 


Vj/(z)  = 


arg  min 
'I' 


IX- 


N  K 

j=i  i=i 


Dictionary  update  (du)  Given  a  current  estimation  for  A,  a  popular  choice  for  the 
dictionary  update  step,  used  in  many  current  state  of  the  art  applications  (e.g.  [21]),  is  the 
Method  of  Optimal  Directions  (mod)  [12]  (k-SVD  could  be  similarly  used).  The  MOD  updates 
D  using  the  standard  least  squares  estimator  =  X(A^*+^))^(A*^‘+^^(A(*+^))^)“^.  In 

our  case,  the  updated  dictionary  D*  is  obtained  by  taking  the  derivative  Vd/(D)  of  (11) 
with  respect  to  D,  and  solving  Vd/(D)  =  0  for  D.  The  resulting  update  is  called  MOCOD 
(Method  of  Optimal  COherence-COnstrained  Directions), 


(x(A(*+i))^  +  2(C  +  ry)D(‘))  x 

A(*+i)(A(‘+b)'r  +  2C(D(*))^D(‘)  +  2  77diag  ((D(*))'^D(*)) 


(14) 


Note  that  when  C  =  0  and  r]  =  0  the  MOCOD  update  (14)  coincides  with  MOD. 


7  Experimental  results:  Sparse  coding  using  mol 

In  the  following  experiments,  all  images  are  converted  to  grayscale  with  an  intensity  range 
of  [0, 1]  and  then  broken  into  patches  of  size  8x8  pixels,  yielding  data  vectors  of  dimension 
n  =  64.  Similar  results  to  those  shown  here  are  also  obtained  for  other  patch  sizes.  We 
choose  not  to  include  them  due  to  space  constrains. 

The  first  set  of  experiments  studies  the  properties  of  the  MOL  distribution  for  sparse 
coding  only.  Its  properties  with  respect  to  the  whole  dictionary  learning  model  will  be 
discussed  in  Section  8  along  with  the  other  added  terms.  For  these  experiments  we  use  a 
global  dictionary  of  A  =  256  atoms  trained  to  the  Pascal  VOC2006  training  subset^  using 
the  model  (1)  with  A  =  0.1.  These  parameter  values  are  typical  in  sparse  coding  applications 
and  produce  dictionaries  D  that  lead  to  state-of-the-art  results  [1,  20,  22]. 

®http : //pascallin. ecs . soton. ac .uk/ challenges/VOC/databases .htinl#V0C2006 
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(l>) 


(c) 


(d) 


Figure  1:  (a)  Empirical  distribution  of  reconstruction  coefficients  for  image  patches  and  the  best 
fitting  distributions,  (b)  Variation  of  the  Laplacian  parameter  6^  in  for  atoms  (sorted  in 
ascending  6^).  (c)  Differences  between  the  Kullback-Leibler  Divergences  for  the  best  fitting  distri¬ 
butions  computed  per  atom  (see  Section  7.1  for  details),  (d)  Reconstruction  psnr  for  the  proposed 
MOL  and  classical  formulations  for  different  £o  (see  Section  7.3  for  details). 


7.1  MOL  as  a  prior  for  reconstruction  coefficients 

We  now  evaluate  the  goodness  of  fit  of  the  Laplacian  and  MOL  distributions  to  the  empirical 
distribution  of  the  reconstruction  coefficients  A  obtained  by  considering  all  its  elements 
{akj}  as  IID.  We  compute  A  using  bp  to  obtain  an  exact  reconstruction, 

min  III  A||i  s.t.  X  =  DA}  , 

and  then  restrict  our  study  to  the  nonzero  elements  of  A.  X  corresponds  to  all  8x8  patches 
from  the  2600  testing  images  from  the  Pascal  VOC2006  dataset. 

The  empirical  distribution  of  A,  p^,  is  plotted  in  Figure  la  (green  dots)  along  with  the 
best  fitting  Laplacian,  Pl,  (in  blue)  and  MOL,  Pmj  distributions  (in  red).  The  fitting  is  done 
using  the  MLE  estimator  for  the  Laplacian  parameter,  9  =  N/  ||  A||  and  for  MOL  we  fix  k  =  3 
and  compute  (3  using  the  method  of  moments,  which  gives  (3  =  {n  —  1)  ||A||^  /N.  The  best 
fitting  distributions  were  obtained  for  0  =  55  and  (3  =  0.05  respectively.  Visual  inspection  of 
Figure  la  reveals  a  much  better  fitting  of  MOL  to  the  empirical  distribution.  For  an  objective 
measure  of  the  goodness  of  fit  we  use  the  standard  Kullback-Leibler  Divergence  (KLD) 
between  p^  and  each  of  the  fitted  distributions  Pl  and  Pm,  yielding  KLD(pe,Pl)  =  0.30  bits 
and  KLD(pe,Pm)  =  0.04,  confirming  the  improved  fitting  properties  of  MOL.  The  empirical 
entropy  is  H{p^)  =  3.00  bits. 

Figure  lb  shows  that  the  optimal  Laplacian  parameter  9^  varies  greatly  with  each  atom 
(sorted  in  ascending  9^).  This  justifies  the  following  experiment,  where  different  Laplacian 
and  MOL  distributions  are  fitted  to  each  k  row  of  A,  A^,  each  corresponding  to  the  fc-th  atom 
in  D.  We  then  compare  the  fitting  obtained  by  the  row-specific  empirical  distributions,  Pg, 
against  the  row-specific  fitted  Laplacian,  p}',  and  MOL,  p^,  and  against  the  globally-fitted 
ones  Pl  and  MOL  Pm.  Figure  Ic  plots  the  difference  between  the  KLDs  for  the  computed 
distributions  and  KLD(pe,Pm),  with  the  horizontal  axis  sorted  by  increasing  KLD(pe,Pl)  — 
KLD(pe,Pm).  Clearly,  the  distributions  of  the  same  type  (Laplacian  or  MOl)  fitted  to  each 
row  should  be  at  least  as  accurate  for  the  statistics  of  that  row  as  the  globally  fitted  ones, 
and  this  can  be  verified  in  Figure  Ic.  We  then  observe  that  Pm  is  significantly  better  than 


Pl  for  every  k.  Furthermore,  we  observe  that  in  251  out  of  256  cases  the  global  pu  performs 
better  than  the  row-specific  p^.  Finally,  in  all  cases,  the  row-specific  are  better  than  the 
corresponding  pj  by  an  approximately  constant,  significant  margin. 

We  conclude  that  mol,  with  the  same  number  of  free  parameters  than  a  Laplacian,  is 
significantly  better  for  the  probabilistic  modeling  of  reconstruction  coefficients.  This  also 
holds  for  row-wise  elements,  providing  further  evidence  that  the  univeral  approach  is  a 
suitable  strategy  for  handling  the  non-stationary  nature  of  image  patch  statistics.  Finally, 
the  fact  that  a  single  globally  fitted  mol  distribution  performs  better,  in  almost  all  cases, 
than  K  Laplacians  fitted  specifically  to  each  row,  shows  that  our  approach  is  also  more 
accurate,  with  only  one  parameter,  than  a  weighted  model  with  K  parameters. 

Wether  these  significantly  improved  fittings  have  a  practical  impact  on  the  performance 
of  the  sparse  coding  stage  is  explored  in  the  following  experiments. 

7.2  Active  set  recovery 

For  the  following  experiments  we  consider  data,  available  as  part  of  the  SIPI  database,^ 
consisting  of  the  six  widely  used  grayscale  images  Barbara,  Boats,  Baboon,  Goldhill,  Lena 
and  Peppers.  We  perform  an  empirical  study  on  the  active  set  recovery  properties  of  the 
MOL  prior  compared  to  those  of  the  £i-based  one.  To  this  end,  we  first  decompose  the 
images  of  the  dataset  into  non-overlapping  8x8  patches  and  obtain  sparse  approximations 
to  them  using  OMP  for  different  considered  target  sparsity  levels  io-  This  way  we  have  a 
set  of  signals  which  have  a  true  sparse  representation  under  the  dictionary  D,  and  whose 
active  set  can  be  used  as  a  ground  truth.  For  each  target  maximum  £q  norm  we  do  an 
exact  reconstruction  of  the  patches  using  BP  (corresponding  to  the  £i  cost)  and  the  “BP 
equivalent”  for  the  MOL  prior. 


(  N  K  ] 

A*  =  argmin  sEElog(l«bl+/9)  s.t.  X  =  DA^. 

We  then  measure  the  accuracy  of  the  recovery  of  each  method  as  the  percentage  of  cases 
'H{ng)  in  which  the  size  of  the  symmetric  difference  between  the  true  and  recovered  active 
sets,  |(A  \  S)  U  (B  \  A)|,  is  no  larger  than  a  certain  target  value  n^.  In  order  to  quantify  how 
this  accuracy  relates  to  the  final  reconstruction  quality  of  the  patches,  we  measure  the  recon¬ 
struction  PSNR  obtained  using  an  Ordinary  Least  Squares  (OLS)  approximation  restricted 
to  the  recovered  active  sets  (this  is  of  course  the  appropriate  reconstruction  procedure  once 
the  active  set  is  determined). 

Table  la  shows  these  results  for  different  maximum  £q  values.  As  can  be  observed,  in  all 
cases  the  MOL-based  recovery  is  significantly  more  successful  in  recovering  the  true  active 
sets,  and  this  translates  into  a  much  better  reconstruction  performance. 

7.3  Sparse  coding  performance  of  MOL 

Given  the  previous  results,  we  expect  sparse  coding  based  on  the  MOL  prior  (8)  to  out¬ 
perform  the  one  obtained  using  an  £i  prior  (1).  Clearly  both  priors  have  different  forms, 
whose  parameters  play  different  roles,  and  thus  comparing  the  reconstruction  psnr  of  both 
formulations  needs  to  be  done  by  carefully  matching  the  regularization  strength  of  the  ti 

^http : //sipl .use . edu/database/ 
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and  MOL  terms.  Since  MOL  is  a  mixture  of  Laplacians,  one  such  way  is  to  fix  the  same  value 
of  cr^  in  both  formulations  and  let  A  =  2a^9  where  6  =  E[9]  is  the  expected  value  of  9  under 
the  Gamma  distribution  used  as  a  prior  on  9  for  computing  the  mol  distribution. 

For  this  case  we  encode  the  SIPI  subset  by  solving  both  and  MOL-based  sparse  coding 
problems.  We  use  k  =  3  and  the  optimal  scale  parameter  /3  =  0.05  found  in  Section  7.1  for 
(8)  and  a  range  of  values  of  0  =  {0.3,  0.6, 1.3, 1.6}  9  where  9  =  k/P  is  the  expected  value  of 
9  corresponding  to  /3,  as  discussed  in  Section  4.  In  both  cases  we  set  a  =  3%  of  the  peak 
value  in  order  to  define  A  in  (1)  and  r  in  (8).  In  order  to  better  observe  the  differences 
in  performance,  we  truncate  the  solutions  to  have  a  fixed  maximum  £q  norm  and  compute 
the  reconstruction  using  OLS  based  on  the  columns  of  D  chosen  by  the  active  set  only.  The 
results  in  Figure  Id  show  that  we  indeed  have  a  better  sparse  reconstruction  with  a  single 
value  of  P  than  any  value  of  A,  thus  confirming  the  advantages  of  using  mol  in  this  case  as 
well. 


8  Experimental  results:  Testing  mocod 

We  now  show  that  the  complete  proposed  model  (Section  5)  improves  both  the  reconstruc¬ 
tion  accuracy  and  the  speed  of  leading  sparse  coding  techniques. 

8.1  Reconstruction  and  generalization  properties 

The  following  experiment  deals  with  the  generalization  properties  of  globally  trained  dic¬ 
tionaries.  In  this  case  we  expect  that  the  designed  incoherence,  which  can  be  seen  as  a 
constrain  on  how  close  the  different  learned  atoms  are  in  reduces  overfitting  to  the 
training  data  by  avoiding  the  clumping  of  atoms  in  regions  where  there  are  exceptional 
concentrations  of  similar  training  vectors. 

We  take  the  images  from  the  SIPI  subset  and  perform  a  leave- one- out  cross-validation 
procedure  where  each  image  is  encoded  using  a  dictionary  trained  with  the  other  images. 
For  training,  a  small  amount  of  white  Gaussian  noise  with  standard  deviation  a  =  2%  of 
the  peak  pixel  value  is  added  so  that  the  parameter  cr^  that  defines  A  in  (1)  and  r  in  (8)  is 
known.  The  initial  dictionary  is  the  one  used  in  Section  7.  After  the  dictionary  is  learned, 
the  patches  of  the  target  image  are  approximated  using  OMP  with  a  maximum  of  Iq  =  12 
nonzero  reconstruction  coefficients  per  patch.  The  image  patches  in  the  reconstruction  step 
are  non-overlapping.®  The  mol  parameters  were  set  to  /c  =  3  and  the  global  optimum  found 
in  Section  7.1  P  =  0.05,  and  9  =  (0.5, 1.0, 1.6}  0,  where  0  =  n/p. 

The  results  are  shown  in  Table  lb,  where  we  compare  MOCOD  against  MOD,  showing  that 
dictionaries  trained  with  MOCOD  yield  a  significant  average  improvement  of  up  to  1  dB  in 
reconstruction  psnr,  thus  providing  evidence  of  the  improved  generalization  properties.  In 
all  cases  we  see  that  the  Gram  matrix  norm  p(D)  and  the  cumulative  coherence  /1(D)  are 
significantly  reduced  as  expected.  We  omit  a  comparison  to  k-SVD  [1]  since  it  was  observed 
in  [21]  that  both  MOD  and  k-SVD  give  very  similar  results  in  general. 

®For  reconstruction,  specially  for  denoising,  state-of-the  art  algorithms  rely  on  overlapping.  We  use  non¬ 
overlapping  patches  in  these  experiment  so  that  differences  in  the  quality  of  the  reconstructed  patches  are 
not  averaged  out. 
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io 

n-e 

SC 

n/  /  \  OLS 

Hine)  DU 

SC 

P  M 

OMP 

PSNR 

1ST 

t  (sec.)  PSNR 

3 

0 

il 

MOL 

35.6  37.4  MOD 

71.1  42.6  MOCOD 

(0  =  150) 

15.8  39.5 

7.5  24.9 

38.7 

39.1 

8.3  33.8 

4.9  34.0 

5 

1 

il 

MOL 

10.6  36.9  MOD 

43.2  42.2  MOCOD 

[0  =  60) 

11.5  31.3 

7.1  24.7 

38.0 

39.0 

7.2  33.6 

4.7  33.9 

8 

2 

il 

MOL 

7.6  37.6  MOD 

30.8  42.3  MOCOD 

(0  =  30) 

15.1  36.5 

6.8  24.4 

38.4 

38.9 

8.1  33.8 

4.5  33.8 

(a)  “OD 

MOCOD 

MOL  (/3  =  0.05) 

20.0  43.2 

7.0  24.6 

38.6 

39.0 

9.7  33.8 

4.8  33.9 

(b) 


Table  1:  (a)  Accuracy  of  the  recovery  of  active  set  and  its  psnr  (see  text,  Section  7.2).  (b) 

Reconstruction  properties.  The  best  results  for  each  case  are  in  bold.  The  values  of  /3,  6,  p  and  p, 
and  the  OMP  psnr  (in  dB)  columns  correspond  to  the  experiment  of  Section  8.1,  while  the  last  two 
columns  are  for  the  1ST  performance  results  of  Section  8.2. 


8.2  Improved  computational  efficiency  of  shrinkage  methods  and 
active  set  recovery 

As  mentioned  in  Section  3,  the  convergence  rate  of  Iterated  Shrinkage/ Thresholding  [7],  and 
of  state-of-the  art  sparse  coding  methods  based  on  it  (see  [11]  for  a  review),  are  known 
to  depend  on  the  Gram  matrix  norm  p(D).  Since  mocod  induces  a  reduction  in  p(D), 
it  is  expected  that  dictionaries  learned  with  MOCOD  lead  to  a  faster  convergence  than  the 
ones  obtained  without  the  additional  coherence  penalty.  The  last  two  columns  in  Table  lb 
account  for  the  running  time  and  final  PSNR  obtained  using  1ST  for  a  fixed  value  of  A  =  0.1 
and  the  corresponding  D.®  The  coding  time  obtained  with  the  dictionaries  trained  with 
MOCOD  is  consistently  reduced  by  an  amount  roughly  proportional  to  the  reduction  in 
P(D),  while  the  PSNR  remains  approximately  the  same  (or  improves  slightly). 

Finally,  we  repeated  the  experiment  in  Section  7.2  using  the  complete  MOCOD  and  ob¬ 
served  further  significant  improvements  of  the  values  in  the  two  rightmost  columns  in  Fig¬ 
ure  la,  for  example,  an  improvement  of  61%  in  H{n/  for  £q  =  8,  =  2,  and  of  2.7dB  in 

PSNR  for  £o  =  3.  This  being  a  direct  consequence  of  reducing  ^(D),  it  further  shows  that  the 
incoherence  term,  on  top  of  MOL,  is  very  important  for  efficient  and  accurate  reconstruction 
of  both  the  active  set  and  the  signal  itself. 

9  Concluding  remarks 

A  new  sparse  model  was  introduced  in  this  work.  The  model  includes  two  new  terms. 
The  first  new  component  encourages  intrinsic  properties  of  the  learning  dictionary  which 
are  critical  for  sparse  coding  and  generalization.  The  second  term  replaces  the  classical 
sparsifying  penalties  by  a  logarithmic  one  formally  derived  following  the  framework  of  MDL. 
Both  the  theoretical  and  practical  advantages  of  such  new  model  were  presented. 

The  critical  properties  of  the  proposed  model,  such  as  increased  stability  of  the  active  set 
and  improved  generalization,  hint  to  the  possible  implications  of  this  model  for  classification 

^Timings  were  obtained  using  a  single-threaded  C  implementation  compiled  with  GCC  4.3.2,  on  a  Lenovo 
T400  with  Core2  Duo  T9400  (at  full  speed)  running  Linux  2.6.27. 
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tasks  such  as  those  described  in  [20] .  We  are  currently  investigating  this  and  results,  together 
with  the  theoretical  details  on  the  derivations  here  presented,  will  be  reported  elsewhere. 
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